All Similarity Calculations

Synergy of severe climate change and extinctions result in taxonomic homogenization in deep yime

Author
Published

2026-01-08

Introduction

Code compiled and curated by anonymous.

This code is used to calculate the “Before” and “After” similarity values for each interval-set, and can be customized for an interval and/or data of your choosing. The calculations for all intervals in our analysis are nearly identical by using this script, with the exception of differences in the interval names and ages. Here, we provide an example using the calculations for the Late Ordovician mass extinction.

Libraries

Code
rpkg <- c("dplyr ggplot2 readr boot divvy terra divDyn conflicted piggyback CoordinateCleaner fossilbrush rgplates icosa tidyr tibble readr purrr downloadthis ggpubr")

import_pkg <- function(x)
  x |> trimws() |> strsplit("\\s+")  |> unlist() |> 
  lapply(function(x) library(x, character.only = T)) |> 
  invisible()

rpkg |> import_pkg()

# Resolve conflicted functions.
conflicted::conflict_prefer(name = "filter", winner = "dplyr",losers = "stats")

# Test

Custom functions

Most of the functions we created for this script are stored here.

Code
#' @return calculate great circle distance in kilometers (km).
#' @param R Earth mean radius (km)
#' @param long1.r convert from degrees to radians for latitudes and longitudes.
#' @export

gcd.slc <- function(long1, lat1, long2, lat2) {
  R <- 6371
  long1.r <- long1*pi/180
  long2.r <- long2*pi/180
  lat1.r <- lat1*pi/180
  lat2.r <- lat2*pi/180
  d <- acos(sin(lat1.r)*sin(lat2.r) + cos(lat1.r)*cos(lat2.r) * cos(long2.r-long1.r)) * R
  return(d) 
  }

# Return calculate jaccard similarity coefficient

jaccard_similarity <- function(x) {
  js_table <- list()
  for (k in seq_along(x)) {
  
  # Unique cells.
  unique_cells <- unique(x[[k]]$cell)
  jaccard_similarity_table <- data.frame(cell_x = character(), cell_y = character(), jaccard_similarity = numeric(), stringsAsFactors = F)
  
  for (i in 1:length(unique_cells)) {
    cell_x <- unique_cells[i]
    # Cell_x
    unique_names_cell_x <- unique(x[[k]]$accepted_name[x[[k]]$cell == cell_x])
    
    for (j in 1:length(unique_cells)) {
      cell_y <- unique_cells[j]
      
      # Duplicate comparisons.
      if (cell_x == cell_y || cell_x > cell_y) {
        next
      }
      
      # Cell_y
      unique_names_cell_y <- unique(x[[k]]$accepted_name[x[[k]]$cell == cell_y])
      # Intersections.
      intersection <- length(generics::intersect(unique_names_cell_x, unique_names_cell_y))
      Un <- length(generics::union(unique_names_cell_x, unique_names_cell_y))
      jaccard_similarity <- intersection/Un
      # Combine results.
      jaccard_similarity_table <- rbind(jaccard_similarity_table, data.frame(cell_x = cell_x, cell_y = cell_y, jaccard_similarity = jaccard_similarity))
    }
  }
  
  # Results.
  js_table[[k]] <- jaccard_similarity_table 
  }
  return(js_table)
}

# Calculate jaccard similarity coefficient
czekanowski_similarity <- function(x) {
  2*abs(sum(x$minimum))/((sum(x$count_cell_x) + sum(x$count_cell_y)))
}

# Cross-join function.
gridComb <- function(x, cell, accepted_name) {
  cA <- expand.grid(cell = unique(x$cell), unique(x$accepted_name)) |> setNames(nm = c("cell","accepted_name"))
  return(cA)
}

# Count taxon occurrence per unique cell combination.
czekanowski_data_prep <- function(x, cell, accepted_name) {  
  
  count_taxa_x <- x |> 
    group_by(cell, accepted_name) |>
    summarize(count = n(), .groups = 'drop') |> rename("cell_x" = "cell", "count_cell_x" ="count")
  
  count_taxa_y <- x |> 
    group_by(cell, accepted_name) |>
    summarize(count = n(), .groups = 'drop') |> rename("cell_y" = "cell", "count_cell_y" ="count")

  # Cell pairs.
  cell <- unique(x[[cell]])
  taxa <- unique(x[[accepted_name]])
  
  cell_combinations <- expand.grid(cell_x = cell, cell_y = cell,accepted_name = taxa) |>  filter(cell_x != cell_y)
  
  result <- cell_combinations |> 
    left_join(count_taxa_x, by = c("cell_x","accepted_name"), relationship = "many-to-many") |> 
    # Second join (y) 
    left_join(count_taxa_y, by = c("cell_y", "accepted_name")) |> 
    select("cell_x", "cell_y", "accepted_name", "count_cell_x", "count_cell_y") |>
    # Replace NA with 0
    replace_na(replace = list(count_cell_x = 0, count_cell_y = 0)) |> 
    # Remove rows that at 0 in both count fields.
    filter(!(count_cell_x == 0 & count_cell_y == 0)) |> 
    # Remove duplicated cell combinations
    filter(cell_x == cell_y | cell_x > cell_y) |> 
    # Split by cell combination
    group_split(cell_x,cell_y)
    
    return(result)
}

Paleobiology Database

The fossil occurrence data analysed in this study was retrieved from the Paleobiology Database on November of 2024. Data pre-processing made use of functions from the fossilbrush and CoordinateCleaner R packages.

The following script shows you how we processed the full Phanerozoic Pbdb dataset that was downloaded using the Pbdb_download_new.R script, which is available in our repository.

However, since that requires you to download the data yourself, we have provided the files of Pbdb data used for each interval pair in the Pbdb_data folder and this section of the script can therefore be skipped. In the case of the Late Ordovician mass extinction, it is labelled pbdb_lome.csv within that folder. So, you have the option to simply load in that file and skip this section if you so choose!

Code
# Load the csv file directly from our repository: 
pbdb <- read.csv("https://raw.githubusercontent.com/Anonymous-paleobio/Taxonomic-Homogenization-DeepTime/refs/heads/main/Pbdb-files/pbdb_lome.csv")
pbdb <- pbdb[, -c(1, 22, 21)]


 
 #If you want to process the raw Phanerozoic-level dataset from 01 yourself, then follow the rest of this chunk of code instead and remove the hashtags from this section using ctrl (or command) + shift + C.
 
 
# Read occurrence dataset you generated in step 01. Replace the directory with the one of where you stored it:
# pbdb <-read.csv(file = '~/Documents/BioHom/pbdb.data.Nov2024.csv')

 
# # Adjust radiometric ages
# interval.ma    <- pbdb |> 
#   group_by(early_interval) |> 
#  summarise(min_ma = min(min_ma))
# names(interval.ma) <-c("early_interval", "interval.ma")
# pbdb       <- merge(pbdb, interval.ma, by=c("early_interval"))
# 
# # Find first and last occurrences and merge back into data frame, using min_ma column
# fadlad <- pbdb |> 
#   group_by(accepted_name)  |> 
#   summarise(
#     fad = max(interval.ma),
#     lad = min(interval.ma)
#   )
# 
# # Merge fad and lad information into data frame
# pbdb <- merge(pbdb, fadlad, by=c("accepted_name"))
# 
# # Add extinction/survivor binary variable
# pbdb$ex <- 0
# pbdb$ex[pbdb$interval.ma==pbdb$lad] <- 1
# 
# # Select variables.
# pbdb <- pbdb |> 
#   select(any_of(c("interval.ma","early_interval","interval.ma","fad","lad",
#                   "accepted_name","genus","ex","phylum","class",
#                   "order","family","paleolat","paleolng","formation","member",
#                   "occurrence_no","collection_no","collection_name",
#                   "reference_no")))
# 
# # Keep two classes and select the age-pair you want.
#  pbdb <- pbdb |> 
#      filter(class %in% c("Gastropoda", "Bivalvia", "Trilobita", "Rhynchonellata", "Strophomenata", "Anthozoa") &
#     interval.ma %in% c("445.2", "443.8", "440.8"))
#  
# # Identify Invalid Coordinates.
#   cl <- cc_val(pbdb, value = "flagged", lat="paleolat", lon  ="paleolng") #flags incorrect coordinates
#   cl_rec <- pbdb[!cl,] #extract and check them
#   
#  pbdb <- pbdb |> 
#    cc_val(lat = "paleolat", lon="paleolng") #remove them
#  
# # Use fossilbrush to clean taxonomic errors
# b_ranks <- c("phylum", "class", "order", "family", "accepted_name") #accepted_name is genus name
# 
# # Define a list of suffixes to be used at each taxonomic level when scanning for synonyms
# b_suff = list(NULL, NULL, NULL, NULL, c("ina", "ella", "etta"))
# 
# pbdb2 <- check_taxonomy(pbdb, suff_set = b_suff, ranks = b_ranks, verbose = FALSE,clean_name = TRUE, resolve_duplicates = TRUE, jump = 5)
# # resolves homonyms, and jump refers to which taxonomic rank is the highest we resolve to. jump = 5 will stop before phylum since phylum level usually has little error.
# 
# # Extract PBDB data from obdb2 so we have the corrected taxa:
# pbdb <- pbdb2$data[1:nrow(pbdb),]
# 
# pbdb_fulldata <- pbdb # keep a record of all pertinent information, just in case

Visualization of Cells and Occurrences

The globe is divided into a grid of equal-area icosahedral hexagonal cells using the hexagrid() function in icosa. In hexagrid(deg = x), is roughly equivalent to longitudinal degrees, so that a degree of 1 is roughly equal to 111 km. This selects a tessellation vector, which translates to the amount of area you select for each cell. In our specified grid, each cell is roughly 629,000 km^2 and results in a grid of 812 cells.

Code
# Use this chunk of code if you are processing the raw dataset yourself. Otherwise, skip it!

pbdb.2before <- pbdb |> filter(interval.ma==445.2)
pbdb.2after <- pbdb |> filter(interval.ma==443.8)
pbdb.2after2 <- pbdb |> filter(interval.ma == 440.8)

# Find raw locations for each stage:
coords.before <- subset(pbdb.2before, select = c(paleolng, paleolat)) 
coords.before <- coords.before |>  
                          mutate_at(c('paleolng', 'paleolat'), as.numeric)
coords.after<- subset(pbdb.2after, select = c(paleolng, paleolat)) 
coords.after <- coords.after |>  
                          mutate_at(c('paleolng', 'paleolat'), as.numeric)
coords.after2<- subset(pbdb.2after2, select = c(paleolng, paleolat)) 
coords.after2<- coords.after2 |>  
                          mutate_at(c('paleolng', 'paleolat'), as.numeric)

# Set up the grid
hexa <- hexagrid(deg= 4.5, sf=TRUE) #each deg = ~111 km
hexa
A/An hexagrid object with 1620 vertices, 2430 edges and 812 faces.
The mean grid edge length is 493.6 km or 4.44 degrees.
Use plot3d() to see a 3d render.
Code
# Find cell locations for each occurrence
cells.before <-locate(hexa, coords.before) 
# str(cells.before) #to see which cells have occ's
cells.after <-locate(hexa, coords.after)
#str(cells.after)
cells.after2 <-locate(hexa, coords.after2)


# Next add cells df to coords df in order to match cells with their coordinates:
coords.before$cell <- cells.before 
names(coords.before) <- c("long", "lat", "cell")
coords.after$cell <- cells.after 
names(coords.after) <- c("long", "lat", "cell")
coords.after2$cell <- cells.after2
names(coords.after2) <- c("long", "lat", "cell")

tcells.before <- table(cells.before) #to get no. of occupied cells
#str(tcells.cha) #get frequency of cell occ's
tcells.after <- table(cells.after)
#str(tcells.ind)
tcells.after2<- table(cells.after2)

data.2before <- cbind(pbdb.2before, coords.before) #assigns cell number for each occurrence
data.2after <- cbind(pbdb.2after, coords.after)
data.2after2 <- cbind(pbdb.2after2, coords.after2)

# pbdb <- rbind(data.2before, data.2after, data.2after2) # use this line only if you are processing the raw dataset yourself.

Grid Plots

Next, visualize all occurrences for each stage, using the package rgplates and icosa on R. Please note that this version requires that you have the GPlates software (v.2.5.0 as of writing this script) installed in your computer, as it is the most optimal version of rgplates.

Code
# Call to Gplates offline (requires installed Gplates software)

td <-tempdir() #temporary directory
#td
rgPath <- system.file(package="rgplates")
#list.files(rgPath) #confirm that this is the correct path
unzip(file.path(rgPath, "extdata/paleomap_v3.zip"), exdir=td)
#list.files(file.path(td)) #confirm extraction has happened by looking at temporary directory
pathToPolygons <- file.path(td, "PALEOMAP_PlatePolygons.gpml") #static plate polygons
pathToRotations <- file.path(td, "PALEOMAP_PlateModel.rot")

pm <- platemodel(
  features = c("static_polygons" = pathToPolygons),
  rotation = pathToRotations
)

# Plot it out:
edge <-mapedge() #edge of the map
plates.lome<- reconstruct("static_polygons", age= 440, model =pm)
plot(edge, col = "lightblue2")
plot(plates.lome$geometry, col = "gray60", border = NA, add = TRUE)
plot(hexa,  border="white",add = TRUE)
gridlabs(hexa, cex=0.5) #get labels for each cell, labeled as spiral from North pole of grid

Before occurrences

Occurrences in the Before stage, with colors indicating the number of occurrences in occupied cells.

Code
# Before
platesMoll <- sf::st_transform(plates.lome, "ESRI:54009")
#^transform plates to Mollweide projection to plot
plot(hexa, tcells.before, 
     crs ="ESRI:54009",
     border = "lightgrey",
     pal=c("#440154ff","darkorchid2","deepskyblue","royalblue", "goldenrod"),
     breaks = c(0, 20, 100, 500, 1000, 2000),
     reset=FALSE)
plot(platesMoll$geometry, add = TRUE,border= NA, col = "#66666688")

After occurrences - Pulse 1

Occurrences in the After (Pulse 1) stage, with colors indicating the number of occurrences in occupied cells.

Code
# After
plot(hexa, tcells.after, 
     crs ="ESRI:54009",
     border = "lightgrey",
     pal=c("#440154ff","darkorchid2","deepskyblue","royalblue", "goldenrod"),
     breaks = c(0, 20, 100, 500, 1000, 2000),
     reset=FALSE)
plot(platesMoll$geometry, add = TRUE,border= NA, col = "#66666688")

After occurrences - Pulse 2

Occurrences in the After (Pulse 2) stage, with colors indicating the number of occurrences in occupied cells.

Code
# After
plot(hexa, tcells.after2, 
     crs ="ESRI:54009",
     border = "lightgrey",
     pal=c("#440154ff","darkorchid2","deepskyblue","royalblue", "goldenrod"),
     breaks = c(0, 20, 100, 500, 1000, 2000),
     reset=FALSE)
plot(platesMoll$geometry, add = TRUE,border= NA, col = "#66666688")

Code
#save as landscape,10*6 

Data pre-processing

We investigate the data by dividing it by stage and taxonomic class. We determine the number of cells and occurrences for each stage.

Code
# Data balance.
pbdb |> 
  arrange(early_interval) |> 
  mutate(early_interval = factor(early_interval, levels=c("Katian", "Hirnantian", "Rhuddanian"))) |> # reorder the intervals
  group_by(class,early_interval) |> 
  count() |> 
  ggplot(mapping = aes(x = class, y = n, fill = class)) +
  geom_bar(stat = "identity") +
  labs(x = NULL, y = "Sample Size") +
  scale_fill_manual(values =  c("#FFBF00","#0072B2","#D5006D","#009E73","black","#984EA3", "#003F5C"))+
  scale_color_manual(values = c("#FFBF00","#0072B2","#D5006D","#009E73", "black","#984EA3", "#003F5C")) +
  facet_wrap(.~ early_interval, scales = "free") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 15, face = "bold"),
        axis.title = element_text(size = 12,face = "bold"),
        axis.text.x = element_text(size = 12, angle=45, hjust=1),
        axis.text.y= element_text(size=12),
        strip.text = element_text(face = "bold"),
        legend.position = "none",
        aspect.ratio = 0.8)

Code
# Set min occurrences
min_occ <- 15

# Katian cells.
 before_pbdb <-
  pbdb |> 
  filter(early_interval == "Katian")

# before_pbdb <- 
#  before_pbdb |> 
  # Cells.
#  mutate(cell = locate(x = hexa,y = before_pbdb |> select("paleolng", "paleolat")))

# Count occurrences per cell and filter by minimum occurrence.
# before_pbdb <-
#  before_pbdb |> 
#  group_by(cell) |>
#  count() |> 
#  setNames(nm = c("cell","occs")) |> 
#  inner_join(before_pbdb, by = c("cell")) |>
#  filter(occs >= min_occ)

# Cell centroids.
# before_centroid <- 
# as.data.frame(centers(hexa))[names(table(before_pbdb$cell)),] |> 
#  rownames_to_column(var = "cell")

# Add centroid to master dataframe: Longitude and Latitude.
# before_pbdb <- 
#  before_pbdb |> 
#  left_join(before_centroid, by = "cell")

# after< cells
 after_pbdb <-
  pbdb |> 
  filter(early_interval == "Hirnantian")

# after_pbdb <- 
# after_pbdb |> 
  # Cells.
#  mutate(cell = locate(x = hexa,y = after_pbdb |> select("paleolng", "paleolat")))

# Count occurrences per cell and filter by minimum occurrence.
# after_pbdb <-
#  after_pbdb |> 
#  group_by(cell) |>
#  count() |> 
#  setNames(nm = c("cell","occs")) |> 
#  inner_join(after_pbdb,by = c("cell")) |>
#  filter(occs >= min_occ)

# Cell centroids
# after_centroid <- 
#  as.data.frame(centers(hexa))[names(table(after_pbdb$cell)),] |> 
#  rownames_to_column(var = "cell")

# Add centroid coordinates to master dataframe.
# after_pbdb <- 
#  after_pbdb |> 
#  left_join(after_centroid, by = "cell")

## After2
 after_pbdb2 <-
  pbdb |> 
  filter(early_interval == "Rhuddanian")

# after_pbdb2 <- 
#  after_pbdb2 |> 
  # Cells.
#  mutate(cell = locate(x = hexa,y = after_pbdb2|> select("paleolng", "paleolat")))

# Count occurrences per cell and filter by minimum occurrence.
# after_pbdb2 <-
#  after_pbdb2 |> 
#  group_by(cell) |>
#  count() |> 
#  setNames(nm = c("cell","occs")) |> 
#  inner_join(after_pbdb2,by = c("cell")) |>
#  filter(occs >= min_occ)

# Cell centroids
# after_centroid2 <- 
#  as.data.frame(centers(hexa))[names(table(after_pbdb2$cell)),] |> 
#  rownames_to_column(var = "cell")

# Add centroid coordinates to master dataframe.
# after_pbdb2 <- 
#  after_pbdb2 |> 
#  left_join(after_centroid2, by = "cell")

# Combine the two datasets: before & after.
# The pbdb dataset is has now been fully pre-processed.
# pbdb <- bind_rows(before_pbdb, after_pbdb, after_pbdb2)

# Create unique identifier for each cell.
# pbdb <- 
#  data.frame(unique(pbdb$cell)) |> 
#  setNames(nm = "cell") |> 
#  mutate(cell_id = c(1:length(cell))) |> 
#  inner_join(pbdb, by = "cell")

# Get number of cells for each age
pbdb |> group_by(interval.ma) |> summarise(unique_cells = n_distinct(cell))
# A tibble: 3 × 2
  interval.ma unique_cells
        <dbl>        <int>
1        441.           15
2        444.           23
3        445.           63
Code
# Plot number of occurrences per stage and cell.
cell_text <- 
  data.frame(
  label = c("N = 23 cells", "N = 62 cells", "N = 15 cells"),
  early_interval = c("Hirnantian", "Katian", "Rhuddanian")
)

# Plot it
pbdb |> 
   group_by(early_interval,cell) |> 
    count() |>
 mutate(early_interval = factor(early_interval, levels=c("Katian", "Hirnantian", "Rhuddanian"))) |> # reorder the intervals
  ggplot(mapping = aes(x = cell, y = n)) + 
  geom_col(col = "white", bg = "#53565A") +
  coord_flip() +
  geom_hline(yintercept = 15, color = "#B83A4B") +
  labs(x = NULL, y = "Occurrences") +
  geom_text(data = cell_text, mapping = aes(x = c(6,12,18), y = 100, label = label),
            hjust   = -1, vjust = -0.1, size = 3) +
  facet_wrap(.~ early_interval,scales = "free",nrow = 1) +
  theme_bw() +
  theme(aspect.ratio = 1.25,
        axis.text  = element_text(size = 8),
        axis.title = element_text(face = "bold"),
        strip.text = element_text(face = "bold"))

For each stage we create individual dataframes based on the cell units and store these into separate lists.

Code
# Data splitting based on cell id and stage.
before_split <-
  pbdb |> 
  filter(early_interval == "Katian") |>
  group_split(cell_id) |> 
  lapply(as.data.frame)

after_split <-
  pbdb |> 
  filter(early_interval == "Hirnantian") |>
  group_split(cell_id) |> 
  lapply(as.data.frame)

after_split2 <-
  pbdb |> 
  filter(early_interval == "Rhuddanian") |>
  group_split(cell_id) |> 
  lapply(as.data.frame)

Subsampling by cells and occurrence

Here we perform subsampling without replacement on our stage-level datasets using 99 iterations. For the “before” age we randomly sample 15 occurrences per cell and repeated the process as stated above. Conversely, for the “after” age, we applied a two-step subsampling procedure by first subsampling down to match the number cells and then by occurrences. The results are subsampled datasets (cell-specific) saved as nested objects within a larger list. These are subsequently, merged into single master dataframes (i.e., the cells) to create one single list containing 99 dataframes.

Code
# after.
set.seed(31)

boot_after2 <- purrr::map(1:99, ~ {
  after_split2 |> 
  # Samples rows uniformly.
 purrr::map(~ sample_n(.x, 15, replace = FALSE))
  }
)

set.seed(3)

boot_after <- purrr::map(1:99, ~ {
  after_split |> 
  # Samples rows uniformly.
 sample(15, replace = FALSE) |> 
 purrr::map(~ sample_n(.x, 15, replace = FALSE))
  }
)

# before.
set.seed(4)

boot_before <- purrr::map(1:99, ~ {
  before_split |> 
  # Step 1. Cells.
  sample(15, replace = FALSE) |> 
  # Step 2. Rows (i.e., occurrences).
  purrr::map(~ sample_n(.x, 15, replace = FALSE))
  }
)

As indicated in the previous section, we here combine cell-specific dataframes (N=13) into single joint dataframes (13*20 = 260 rows). This is repeated for all 99 sub-sampled dataframes. Worthy of note, the cells in the after list, will inevitably vary between the subsampled datasets, whereas, in the case of the “before” age they are all identical. This is because our analysis seeks to assess the impact by cell heterogeneity across geologic stages.

Code
# Before.
combined_boot_before <- 
  list()

for(i in seq_along(boot_before)) {
  pBe <- purrr::map_dfr(boot_before[[i]], bind_rows)
  combined_boot_before[[i]] <- pBe
}

# after.
combined_boot_after <- 
  list()

for(i in seq_along(boot_after)) {
  pAf <- purrr::map_dfr(boot_after[[i]], bind_rows)
  combined_boot_after[[i]] <- pAf
}

# after-2.
combined_boot_after2 <- 
  list()

for(i in seq_along(boot_after2)) {
  pAf2 <- purrr::map_dfr(boot_after2[[i]], bind_rows)
  combined_boot_after2[[i]] <- pAf2
}

Generic occurrence per cell

For each subsampled dataset in both the After and Before lists we here count the number of occurrence of each genera by cell. This is done for all dataframes and are then combined into one master dataframe.

Code
# before.
before_count_ls <- 
  purrr::map(combined_boot_before, ~ .x |> group_by(cell,accepted_name) |> summarise(occs = n(), .groups = 'drop')) |> 
  lapply(as.data.frame) |> 
  bind_rows()

# after.
after_count_ls <- 
  purrr::map(combined_boot_after, ~ .x |> group_by(cell,accepted_name) |> summarise(occs = n(), .groups = 'drop')) |> 
  lapply(as.data.frame) |> 
  bind_rows()

#after.2
after_count_ls2 <- 
  purrr::map(combined_boot_after2, ~ .x |> group_by(cell,accepted_name) |> summarise(occs = n(), .groups = 'drop')) |> 
  lapply(as.data.frame) |> 
  bind_rows()

Unique cell pairs

Code
# before & after.
cells_distinct_before <- 
  tibble(unique(before_count_ls$cell)) |> setNames(nm = "x") |> 
  mutate(n_part = as.numeric(sub("F", "", x))) |> 
  arrange(n_part) |> 
  pull(x)

cells_distinct_after <- 
  tibble(unique(after_count_ls$cell)) |> setNames(nm = "x") |> 
  mutate(n_part = as.numeric(sub("F", "", x))) |> 
  arrange(n_part) |> 
  pull(x)

cells_distinct_after2 <- 
  tibble(unique(after_count_ls2$cell)) |> setNames(nm = "x") |> 
  mutate(n_part = as.numeric(sub("F", "", x))) |> 
  arrange(n_part) |> 
  pull(x)

# Distinct cell pairs.
cells_distinct_pair_before <-
  expand.grid(cells_distinct_before,cells_distinct_before,stringsAsFactors = F) |> 
  setNames(nm = c("x","y")) |> 
  filter(x<y) |> 
  as_tibble() 

cells_distinct_pair_after <-
  expand.grid(cells_distinct_after,cells_distinct_after,stringsAsFactors = F) |>
  setNames(nm = c("x","y")) |> 
  filter(x<y) |> 
  as_tibble() 

cells_distinct_pair_after2 <-
  expand.grid(cells_distinct_after2,cells_distinct_after2,stringsAsFactors = F) |>
  setNames(nm = c("x","y")) |> 
  filter(x<y) |> 
  as_tibble() 

Jaccard indices

The Jaccard similiary equation following Miller et al., 2009

J(Cell X, Cell Y) = \frac{|Cell X \cap Cell Y|}{|Cell X \cup Cell Y|}

Code
# before.
before_jaccard <- 
  jaccard_similarity(combined_boot_before)

# after.
after_jaccard <- 
  jaccard_similarity(combined_boot_after)

# after2.
after_jaccard2 <- 
  jaccard_similarity(combined_boot_after2)
Code
# Average similarity for each cell-pair and stage.

# before.
ave_before_jaccard <- 
  bind_rows(before_jaccard) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")

# after
ave_after_jaccard <- 
  bind_rows(after_jaccard) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")

# after2
ave_after_jaccard2 <- 
  bind_rows(after_jaccard2) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")

Great circle distance

Code
# colnames(pbdb)[23] <- "lat"
# colnames(pbdb)[22] <- "long" 

#before
before_res_matrix <- cells_distinct_pair_before |> 
  # X-coordinates
  left_join(pbdb |> select(cell,long,lat), by = c("x" = "cell"),relationship = "many-to-many") |> 
  distinct(x,y,long,lat) |> 
  rename("x_long" = "long","x_lat" = "lat") |> 
  # Y-coordinates
  left_join(pbdb |> select(cell,long,lat), by = c("y" = "cell"),relationship = "many-to-many") |> 
  distinct(x,y,x_long,x_lat,long,lat) |> 
  # Rename variables.
  rename("y_long" = "long","y_lat" = "lat") |> 
  # GCD.
  mutate(gcd = gcd.slc(long1 = x_long,lat1 = x_lat,long2 = y_long,lat2 = y_lat)) |> 
  as.data.frame()

# after.
after_res_matrix <- cells_distinct_pair_after |> 
  # X-coordinates
  left_join(pbdb |> select(cell,long,lat), by = c("x" = "cell"),relationship = "many-to-many") |> 
  distinct(x,y,long,lat) |> 
  rename("x_long" = "long","x_lat" = "lat") |> 
  # Y-coordinates
  left_join(pbdb |> select(cell,long,lat), by = c("y" = "cell"),relationship = "many-to-many") |> 
  distinct(x,y,x_long,x_lat,long,lat) |> 
  # Rename variables.
  rename("y_long" = "long","y_lat" = "lat") |> 
  # GCD.
  mutate(gcd = gcd.slc(long1 = x_long,lat1 = x_lat,long2 = y_long,lat2 = y_lat)) |> 
  as.data.frame()

# after2.
after_res_matrix2 <- cells_distinct_pair_after2 |> 
  # X-coordinates
  left_join(pbdb |> select(cell,long,lat), by = c("x" = "cell"),relationship = "many-to-many") |> 
  distinct(x,y,long,lat) |> 
  rename("x_long" = "long","x_lat" = "lat") |> 
  # Y-coordinates
  left_join(pbdb |> select(cell,long,lat), by = c("y" = "cell"),relationship = "many-to-many") |> 
  distinct(x,y,x_long,x_lat,long,lat) |> 
  # Rename variables.
  rename("y_long" = "long","y_lat" = "lat") |> 
  # GCD.
  mutate(gcd = gcd.slc(long1 = x_long,lat1 = x_lat,long2 = y_long,lat2 = y_lat)) |> 
  as.data.frame()

Czekanowski indices

Czekanowski equation following Miller et al., 2009

Czekanowski = 2 * \frac{\sum \min(x_{1k}, x_{2k})}{\sum x_{1k} + \sum x_{2k}} The occurrence of a given taxa between distinct cells are evaluated against each other.

Code
# before.
# before.
before_combs <- gridComb(x = before_pbdb,cell = cell, accepted_name = accepted_name) # 13*221 = 2873 rows.

# after.
after_combs <- gridComb(x = after_pbdb,cell = cell, accepted_name = accepted_name) # 20*93 = 1860 rows. 190 unique cell pairs (check!)

after_combs2 <- gridComb(x = after_pbdb2,cell = cell, accepted_name = accepted_name) # 20*93 = 1860 rows. 190 unique cell pairs (check!)

# Next count the occurrence of genera per unique cell. This will also include genera with no occurrence in any given cell (i.e. 0).
# These are subsequently removed in the next step.

countGen <- function(combinations, age_lists) {
  purrr::map(seq_along(age_lists), function(i) {
    name_counts <- 
      combinations |> 
      left_join(age_lists[[i]] |>  group_by(cell, accepted_name) |> count(), by = c("cell", "accepted_name")) |> 
      # Replace NA with 0.
      replace_na(list(n = 0))
    return(name_counts)
  })
}

# before.
before_genCell <- countGen(combinations = before_combs, age_lists = combined_boot_before)
# after.
after_genCell <- countGen(combinations = after_combs, age_lists = combined_boot_after)
after_genCell2 <- countGen(combinations = after_combs2, age_lists = combined_boot_after2)

# Create two identical count dataframes for each pair to join against.

# before.
before_count_lsX <- purrr::map(before_genCell, ~ .x |> rename("cell_x" = "cell", "count_cell_x" = "n"))
before_count_lsY <- purrr::map(before_genCell, ~ .x |> rename("cell_y" = "cell", "count_cell_y" = "n"))
# after.
after_count_lsX <- purrr::map(after_genCell, ~ .x |> rename("cell_x" = "cell", "count_cell_x" = "n"))
after_count_lsY <- purrr::map(after_genCell, ~ .x |> rename("cell_y" = "cell", "count_cell_y" = "n"))
# after.2
after_count_lsX2 <- purrr::map(after_genCell2, ~ .x |> rename("cell_x" = "cell", "count_cell_x" = "n"))
after_count_lsY2<- purrr::map(after_genCell2, ~ .x |> rename("cell_y" = "cell", "count_cell_y" = "n"))


set.seed(5)

# Merge counts for each cell pair.
 mCount <- function(cell_pairs, X, Y) {
  
  purrr::map(1:99, function(i) {
    # Rename the fields so that it matches.
    oG <- 
      cell_pairs |> rename("cell_x" = "x", "cell_y" = "y") |> 
      # First join (x)
      left_join(X[[i]], by = "cell_x", relationship = "many-to-many") |> 
      # Second join (y) 
      left_join(Y[[i]], by = c("cell_y", "accepted_name")) |> 
      select("cell_x", "cell_y", "accepted_name", "count_cell_x", "count_cell_y")
    
    return(oG)
  })
}

# before.
before_joined <- mCount(cell_pairs = cells_distinct_pair_before,X = before_count_lsX, Y = before_count_lsY)

# after.
after_joined <- mCount(cell_pairs = cells_distinct_pair_after,X = after_count_lsX, Y = after_count_lsY)

# after.2
after_joined2 <- mCount(cell_pairs = cells_distinct_pair_after2,X = after_count_lsX2, Y = after_count_lsY2)

# We then split based on distinct cell pairs. This will creates a nested list with X splits each dataframe i.e. 99. We also remove any genera (i.e. accepted name) were 0 occurrences is recorded between cell pairs.
# This step also add a new field (the minimum field) which is based on the lowest number occurrences of a particular taxa between two cells.

czekanowski_splits <- function(joined_lists) {
  
  purrr::map(1:99, function(i) {
  oP <- joined_lists[[i]] |>
    # Remove
    filter(!(count_cell_x == 0 & count_cell_y == 0)) |> 
    # Compute the minimum value between cell x and cell y (use count variable)
    mutate(minimum = pmin(count_cell_x, count_cell_y)) |> 
    group_by(cell_x, cell_y) |>  
    group_split()
  
  return(oP)
  })
}

cz_before_prep <- czekanowski_splits(before_joined)
cz_after_prep <- czekanowski_splits(after_joined)
cz_after_prep2 <- czekanowski_splits(after_joined2)

# Compute the czekanowski index.
before_czekanowski <- vector(mode = "list")
for(i in seq_along(cz_before_prep)) {
  cz <- lapply(cz_before_prep[[i]], czekanowski_similarity)
  before_czekanowski[[i]] <- cz
}

after_czekanowski <- vector(mode = "list")
for(i in seq_along(cz_after_prep)) {
  cz <- lapply(cz_after_prep[[i]], czekanowski_similarity)
  after_czekanowski[[i]] <- cz
}

after_czekanowski2 <- vector(mode = "list")
for(i in seq_along(cz_after_prep2)) {
  cz <- lapply(cz_after_prep2[[i]], czekanowski_similarity)
  after_czekanowski2[[i]] <- cz
}

# Cell pairs.
pairs_before <- do.call("rbind",lapply(cz_before_prep[[1]], function(x) x[1:2][1,]))

# Find all pairs in the After
pairs_after <- vector(mode = "list")
pairs_after2 <- vector(mode = "list")

for(i in 1:99) {
  append_cells <- do.call("rbind",lapply(cz_after_prep[[i]], function(x) x[1:2][1,]))
  pairs_after[[i]] <- append_cells
}
for(i in 1:99) {
  append_cells <- do.call("rbind",lapply(cz_after_prep2[[i]], function(x) x[1:2][1,]))
  pairs_after2[[i]] <- append_cells
}

# Reformat 
before_cz_results <- 
  purrr::map(before_czekanowski, ~as.data.frame(unlist(.x)) |> 
               rename("cz" = 1) |>
               cbind(pairs_before) |> 
               relocate(.after = "cell_y","cz") |> 
               rename("x.cell" = "cell_x", "y.cell" = "cell_y") )

after_cz_results <- 
  purrr::map(after_czekanowski, ~as.data.frame(unlist(.x)) |> 
               rename("cz" = 1))

after_cz_results2 <- 
  purrr::map(after_czekanowski2, ~as.data.frame(unlist(.x)) |> 
               rename("cz" = 1))

# Now bind the cell pairs to the Af\ter datasets.
after_cz_results <- mapply(function(x, y) cbind(y, x), after_cz_results, pairs_after, SIMPLIFY = FALSE)
after_cz_results2 <- mapply(function(x, y) cbind(y, x), after_cz_results2, pairs_after2, SIMPLIFY = FALSE)

# Compute the average czekanowski per cell pair
before_czekanowski_dataframe <- bind_rows(before_cz_results) 
after_czekanowski_dataframe <- bind_rows(after_cz_results)
after_czekanowski_dataframe2 <- bind_rows(after_cz_results2)

Results

Code
# before
before_res_matrix <- 
  before_res_matrix |> 
  rename("x.cell" = "x","y.cell" = "y") |> 
  left_join(ave_before_jaccard,by = c("x.cell","y.cell"))

# after
after_res_matrix <- 
  after_res_matrix |> 
  rename("x.cell" = "x","y.cell" = "y") |> 
  left_join(ave_after_jaccard,by = c("x.cell","y.cell"))

# after2
after_res_matrix2 <- 
  after_res_matrix2 |> 
  rename("x.cell" = "x","y.cell" = "y") |> 
  left_join(ave_after_jaccard2,by = c("x.cell","y.cell"))

# Bin by distance between cells (GCD in km's)
before_res_matrix$cutdist <- 
  cut(before_res_matrix$gcd,
      breaks = c(0, 2000, 4000, 6000, 8000, 10000, 12000, 
                 14000, 16000, 18000, 20000), 
      labels = c("0", "2000", "4000", "6000","8000", 
                 "10000", "12000","14000", "16000", "18000"),
                       include.lowest = TRUE)

after_res_matrix$cutdist <- 
  cut(after_res_matrix$gcd,
      breaks = c(0, 2000, 4000, 6000, 8000, 10000, 12000, 
                 14000, 16000, 18000, 20000), 
      labels = c("0", "2000", "4000", "6000","8000", 
                 "10000", "12000","14000", "16000", "18000"),
                       include.lowest = TRUE)

after_res_matrix2$cutdist <- 
  cut(after_res_matrix2$gcd,
      breaks = c(0, 2000, 4000, 6000, 8000, 10000, 12000, 
                 14000, 16000, 18000, 20000), 
      labels = c("0", "2000", "4000", "6000","8000", 
                 "10000", "12000","14000", "16000", "18000"),
                       include.lowest = TRUE)

# Average and sd for Before.
sumRes_01 <-
  before_res_matrix |> 
  group_by(cutdist) |> 
  summarise(
    # Jaccard
    avg =  mean(avg_jaccard, na.rm = TRUE),
    sdev = sd(avg_jaccard, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
    # Quantiles
    first = quantile(avg_jaccard,probs=0.25, na.rm= TRUE),
    second = quantile(avg_jaccard,probs=0.975, na.rm = TRUE)
  ) |> 
  mutate(label = 'Before',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.975, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame() |> suppressWarnings() # This was added to ignore the last observation.

# Average and sd for the After.
sumRes_02 <- 
  after_res_matrix |> 
  group_by(cutdist) |> 
  summarise(
    # Jaccard
    avg =  mean(avg_jaccard, na.rm = TRUE),
    sdev = sd(avg_jaccard, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
     # Quantiles
    first = quantile(avg_jaccard,probs=0.25),
    second = quantile(avg_jaccard,probs=0.975)
  ) |> 
  mutate(label = 'After',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.975, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame() |> suppressWarnings() # This was added to ignore the last observation.

# Average and sd for the After 2.
sumRes_025 <- 
  after_res_matrix2 |> 
  group_by(cutdist) |> 
  summarise(
    # Jaccard
    avg =  mean(avg_jaccard, na.rm = TRUE),
    sdev = sd(avg_jaccard, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
     # Quantiles
    first = quantile(avg_jaccard,probs=0.25),
    second = quantile(avg_jaccard,probs=0.95)
  ) |> 
  mutate(label = 'After-2',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.95, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame() |> suppressWarnings() # This was added to ignore the last observation.

# Combine the two.
sumRes_03 <- bind_rows(sumRes_01,sumRes_02, sumRes_025)


# Plot.
sumRes_03 |> 
   arrange(label) |> 
  mutate(label = factor(label, levels=c("Before", "After", "After-2"))) |> # reorder the intervals
   ggplot(aes(x = cutdist, y = avg, group = label, colour = label, fill = label)) +
  geom_errorbar(aes(ymin = second, ymax = first), width = 0.05, linewidth = 1, position = position_dodge(width = 0.3)) + 
  geom_line(linewidth = 1.2, position = position_dodge(width = 0.3)) +
  ylim(0, 0.7) +
  scale_size_continuous(breaks = c(5, 10, 15, 20, 25, 30)) +
  geom_point(aes(size = n), shape = 21, fill = "white", stroke = 2, position = position_dodge(width = 0.3)) +
  scale_fill_manual(values =  c("black", "darkorange4","goldenrod3")) + 
  scale_color_manual(values = c("black","darkorange4","goldenrod3")) + 
  labs(x = "Great Circle Distance (km)", 
       y = "Similarity Value",
       title = "Jaccard",
       subtitle = "Subsampling by cells and occurrences", colour = "Stages", size = "Cell-pair comparison") +
  theme_classic() +
  theme(plot.title = element_text(face = "bold"),
        axis.title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        aspect.ratio = 1)

Code
# before
before_res_matrix <-
  before_res_matrix |>
  left_join(
    before_czekanowski_dataframe |> #rename("x.cell" = "cell_x", "y.cell" = "cell_y") |>
      group_by(x.cell,y.cell) |>
      summarise(avg_cz =  mean(cz, na.rm = TRUE)), by = c("x.cell","y.cell")) |>
  relocate(.after = "avg_jaccard","avg_cz")
`summarise()` has grouped output by 'x.cell'. You can override using the
`.groups` argument.
Code
# after
after_res_matrix <-
  after_res_matrix |>
  left_join(
    after_czekanowski_dataframe |> rename("x.cell" = "cell_x", "y.cell" = "cell_y") |> 
      group_by(x.cell,y.cell) |>
      summarise(avg_cz =  mean(cz, na.rm = TRUE)), by = c("x.cell","y.cell")) |>
  relocate(.after = "avg_jaccard","avg_cz")
`summarise()` has grouped output by 'x.cell'. You can override using the
`.groups` argument.
Code
# after2
after_res_matrix2 <-
  after_res_matrix2 |>
  left_join(
    after_czekanowski_dataframe2 |> rename("x.cell" = "cell_x", "y.cell" = "cell_y") |> 
      group_by(x.cell,y.cell) |>
      summarise(avg_cz =  mean(cz, na.rm = TRUE)), by = c("x.cell","y.cell")) |>
  relocate(.after = "avg_jaccard","avg_cz")
`summarise()` has grouped output by 'x.cell'. You can override using the
`.groups` argument.
Code
# Average and standard deviation for Before.
sumRes_045 <-
  after_res_matrix2 |>
  group_by(cutdist) |>
  summarise(
    avg =  mean(avg_cz, na.rm = TRUE),
    sdev = sd(avg_cz, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
        # Quantiles
    first = quantile(avg_cz,probs=0.25, na.rm=TRUE),
    second = quantile(avg_cz,probs=0.95, na.rm=TRUE)
  ) |>
  mutate(label = 'After-2',label = as.factor(label)) |>
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |>
  mutate(ci = se * qt(.975, n - 1), ci = as.numeric(ci)) |>
  as.data.frame() |> suppressWarnings() # This was added to ignore the last observation.

# Average and standard deviation for Before.
sumRes_04 <-
  before_res_matrix |>
  group_by(cutdist) |>
  summarise(
    avg =  mean(avg_cz, na.rm = TRUE),
    sdev = sd(avg_cz, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
        # Quantiles
    first = quantile(avg_cz,probs=0.25, na.rm=TRUE),
    second = quantile(avg_cz,probs=0.95, na.rm=TRUE)
  ) |>
  mutate(label = 'Before',label = as.factor(label)) |>
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |>
  mutate(ci = se * qt(.975, n - 1), ci = as.numeric(ci)) |>
  as.data.frame() |> suppressWarnings() # This was added to ignore the last observation.

sumRes_05 <-
  after_res_matrix |>
  group_by(cutdist) |>
  summarise(
    avg =  mean(avg_cz, na.rm = TRUE),
    sdev = sd(avg_cz, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
    # Quantiles
    first = quantile(avg_cz,probs=0.25, na.rm=TRUE),
    second = quantile(avg_cz,probs=0.95, na.rm=TRUE)
  ) |>
  mutate(label = 'After',label = as.factor(label)) |>
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |>
  mutate(ci = se * qt(.975, n - 1), ci = as.numeric(ci)) |>
  as.data.frame() |> suppressWarnings()

sumRes_06 <- bind_rows(sumRes_05, sumRes_04, sumRes_045)

# Plot.
sumRes_06 |> 
   arrange(label) |> 
  mutate(label = factor(label, levels=c("Before", "After", "After-2"))) |> # reorder the intervals
  ggplot(aes(x = cutdist, y = avg, group = label, colour = label, fill = label)) +
  geom_errorbar(aes(ymin = second, ymax = first), width = 0.05, linewidth = 1, position = position_dodge(width = 0.3)) + 
  geom_line(linewidth = 1.2, position = position_dodge(width = 0.3)) +
  ylim(0, 0.7) +
  scale_size_continuous(breaks = c(5, 10, 15, 20, 25, 30)) +
  geom_point(aes(size = n), shape = 21, fill = "white", stroke = 2, position = position_dodge(width = 0.3)) +
  scale_fill_manual(values = c("black","darkorange4", "goldenrod3")) + 
  scale_color_manual(values = c("black", "darkorange4","goldenrod3")) + 
  labs(x = "Great Circle Distance (km)", 
       y = "Similarity Value", 
       title = "Czekanowski", 
       subtitle = "Subsampling by cells and occurrences", colour = "Stages", size = "Cell-pair comparison") +
  theme_classic() +
  theme(plot.title = element_text(face = "bold"), 
        axis.title = element_text(face = "bold"), 
        legend.title = element_text(face = "bold"), 
        aspect.ratio = 1)

Sensitivity analysis

Similarity measurements by survival status.

Code
# Retain occurrences with or greater than 15.
pbdb_sensitivity <- 
  pbdb |> 
  filter(occs >= 15) # 3578 observations.

# Split survival datasets by unique cell id.

#First pulse:

# after survivors
sAft <- pbdb_sensitivity |> 
  filter(early_interval == "Hirnantian" & ex==0 & fad >=443.8) |>
  group_split(cell_id) |> lapply(as.data.frame)

# before survivors
sV <- pbdb_sensitivity |>
  filter(early_interval == "Katian" & ex==1) |>
  group_split(cell_id) |> lapply(as.data.frame)

# after survivors.
sBef <- pbdb_sensitivity |> 
  filter(early_interval  == "Katian" & ex==0) |>
  group_split(cell_id) |> lapply(as.data.frame)

## THIS IS FOR ANALYSIS OF SECOND PULSE: 

# Survivors After:
sAftHir <- pbdb_sensitivity |> 
  filter(early_interval == "Rhuddanian" & ex==0 & fad >=440.8) |>
  group_split(cell_id) |> lapply(as.data.frame)

# Victims of Hirnantian pulse.
sVHir <- pbdb_sensitivity |>
  filter(early_interval == "Hirnantian" & ex==1) |>
  group_split(cell_id) |> lapply(as.data.frame)

# Survivors of Hirnantian pulse.
sBefHir <- pbdb_sensitivity |> 
  filter(early_interval  == "Hirnantian" & ex==0) |>
  group_split(cell_id) |> lapply(as.data.frame)


# Subsampling.
subsampling_fun <-
  function(x, n_boot = 99, sample_size = 12, seed = 5) {
  set.seed(seed)
  # Samples.
  boot_samples <- purrr::map(1:n_boot, ~ sample(x, sample_size, replace = FALSE))
  # Combine cells into single dataframes.
  comb_samples <- purrr::map(boot_samples, ~ map_dfr(.x, bind_rows))
  
  return(comb_samples)
}


# Subsampled data.
sBef_boot <- subsampling_fun(sBef)
sV_boot <- subsampling_fun(sV)
sAft_boot <- subsampling_fun(sAft)


# Second pulse.
sBef_boot_Hir <- subsampling_fun(sBefHir)
sV_boot_Hir <- subsampling_fun(sVHir)
sAft_boot_Hir <- subsampling_fun(sAftHir)

###Jaccard index calculation

Code
# After survivors.
after_survivors_jaccard <- jaccard_similarity(sAft_boot)
# Before victims.
before_victims_jaccard <- jaccard_similarity(sV_boot)
# Before survivors.
before_survivors_jaccard <- jaccard_similarity(sBef_boot)


#Second pulse:

# After survivors.
after_survivors_jaccard_Hir <- jaccard_similarity(sAft_boot_Hir)
# Before victims.
before_victims_jaccard_Hir <- jaccard_similarity(sV_boot_Hir)
# Before survivors.
before_survivors_jaccard_Hir <- jaccard_similarity(sBef_boot_Hir)

Averages

Code
##First pulse:

# Mean jaccard for the After survivors.
aAftsJ <- 
  bind_rows(after_survivors_jaccard) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")
`summarise()` has grouped output by 'cell_x'. You can override using the
`.groups` argument.
Code
# Mean jaccard for the Before victims.
avJ <- 
  bind_rows(before_victims_jaccard) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")
`summarise()` has grouped output by 'cell_x'. You can override using the
`.groups` argument.
Code
# before survivors.
aBefsJ <- 
  bind_rows(before_survivors_jaccard) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")
`summarise()` has grouped output by 'cell_x'. You can override using the
`.groups` argument.
Code
## Second pulse:

# Mean jaccard for the After survivors.
aAftsJ_Hir <- 
  bind_rows(after_survivors_jaccard_Hir) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")
`summarise()` has grouped output by 'cell_x'. You can override using the
`.groups` argument.
Code
# Mean jaccard for the Before victims.
avJ_Hir <- 
  bind_rows(before_victims_jaccard_Hir) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")
`summarise()` has grouped output by 'cell_x'. You can override using the
`.groups` argument.
Code
# before survivors.
aBefsJ_Hir <- 
  bind_rows(before_survivors_jaccard_Hir) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")
`summarise()` has grouped output by 'cell_x'. You can override using the
`.groups` argument.

Visualize results

Code
## FIRST PULSE: 

# After results: survivors.
mRes_01 <- after_res_matrix[,c(1:7,10)] |> left_join(aAftsJ,by = c("x.cell","y.cell"))
# Before results: survivors & victims.
mRes_02 <- before_res_matrix[,c(1:7,10)] |> left_join(avJ,by = c("x.cell","y.cell"))
mRes_03 <- before_res_matrix[,c(1:7,10)] |> left_join(aBefsJ,by = c("x.cell","y.cell"))

# SECOND PULSE 
mRes_04 <- after_res_matrix2[,c(1:7,10)] |>  left_join(aAftsJ_Hir,by = c("x.cell","y.cell"))
# Before results: survivors & victims.
mRes_05 <- after_res_matrix[,c(1:7,10)] |> left_join(avJ_Hir,by = c("x.cell","y.cell"))
mRes_06  <- after_res_matrix[,c(1:7,10)] |> left_join(aBefsJ_Hir,by = c("x.cell","y.cell"))

# Summary statistics for each survival category.


## FIRST PULSE 
# after survivors.
sumRes_07 <-
  mRes_01 |> 
  group_by(cutdist) |> 
  summarise(
    # Jaccard
    avg =  mean(avg_jaccard, na.rm = TRUE),
    sdev = sd(avg_jaccard, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
       # Quantiles
    first = quantile(avg_jaccard,probs=0.25,na.rm=TRUE),
    second = quantile(avg_jaccard,probs=0.95, na.rm=TRUE)
  ) |> 
  mutate(label = 'After survivors',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.95, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame()
Code
# before victims.
sumRes_08 <-
  mRes_02 |> 
  group_by(cutdist) |> 
  summarise(
    # Jaccard
    avg =  mean(avg_jaccard, na.rm = TRUE),
    sdev = sd(avg_jaccard, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
    # Quantiles
    first = quantile(avg_jaccard,probs=0.25, na.rm=TRUE),
    second = quantile(avg_jaccard,probs=0.95, na.rm=TRUE)
  ) |> 
  mutate(label = 'Before victims',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.95, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame() |> suppressWarnings()

# before survivors.
sumRes_09 <-
  mRes_03 |> 
  group_by(cutdist) |> 
  summarise(
    # Jaccard
    avg =  mean(avg_jaccard, na.rm = TRUE),
    sdev = sd(avg_jaccard, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
       # Quantiles
    first = quantile(avg_jaccard,probs=0.25, na.rm=TRUE),
    second = quantile(avg_jaccard,probs=0.95, na.rm=TRUE)
  ) |> 
  mutate(label = 'Before survivors',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.95, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame() |> suppressWarnings()

## SECOND PULSE 

# after survivors.
sumRes_10 <-
  mRes_04 |> 
  group_by(cutdist) |> 
  summarise(
    # Jaccard
    avg =  mean(avg_jaccard, na.rm = TRUE),
    sdev = sd(avg_jaccard, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
       # Quantiles
    first = quantile(avg_jaccard,probs=0.25,na.rm=TRUE),
    second = quantile(avg_jaccard,probs=0.95, na.rm=TRUE)
  ) |> 
  mutate(label = 'After survivors',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.95, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame()

# before victims.
sumRes_11  <-
  mRes_05 |> 
  group_by(cutdist) |> 
  summarise(
    # Jaccard
    avg =  mean(avg_jaccard, na.rm = TRUE),
    sdev = sd(avg_jaccard, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
    # Quantiles
    first = quantile(avg_jaccard,probs=0.25, na.rm=TRUE),
    second = quantile(avg_jaccard,probs=0.95, na.rm=TRUE)
  ) |> 
  mutate(label = 'Before victims',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.95, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame() |> suppressWarnings()

# before survivors.
sumRes_12 <-
  mRes_06 |> 
  group_by(cutdist) |> 
  summarise(
    # Jaccard
    avg =  mean(avg_jaccard, na.rm = TRUE),
    sdev = sd(avg_jaccard, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
       # Quantiles
    first = quantile(avg_jaccard,probs=0.25, na.rm=TRUE),
    second = quantile(avg_jaccard,probs=0.95, na.rm=TRUE)
  ) |> 
  mutate(label = 'Before survivors',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.95, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame() |> suppressWarnings()

# Combine all results.
sumRes_13 <- bind_rows(sumRes_07,sumRes_08,sumRes_09) # First pulse
sumRes_14 <- bind_rows(sumRes_10,sumRes_11,sumRes_12) # Second pulse

# Plot.
sumRes_13 |> 
  arrange(label) |> 
   mutate(label = factor(label, levels=c("Before victims", "Before survivors", "After survivors"))) |> # reorder the intervals
   ggplot(aes(x = cutdist, y = avg, group = label, colour = label, fill = label)) +
  geom_errorbar(aes(ymin = second, ymax = first), width = 0.05, linewidth = 1, position = position_dodge(width = 0.3)) + 
  geom_line(linewidth = 1.2, position = position_dodge(width = 0.3)) +
  ylim(0, 0.7) +
  scale_size_continuous(breaks = c(5, 10, 15, 20, 25, 30)) +
  geom_point(aes(size = n), shape = 21, fill = "white", stroke = 2, position = position_dodge(width = 0.3)) +
  scale_fill_manual(values =  c("gray60","black","goldenrod")) + 
  scale_color_manual(values = c("gray60","black","goldenrod")) + 
  labs(x = "Great Circle Distance (km)", 
       y = "Similarity Value",
       title = "First Pulse",
       subtitle = "Jaccard by survival status", colour = "Stages", size = "Cell-pair comparison") +
  theme_classic() +
  theme(plot.title = element_text(face = "bold"),
        axis.title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        aspect.ratio = 1)

Code
# Plot for second pulse
sumRes_14 |> 
  arrange(label) |> 
   mutate(label = factor(label, levels=c("Before victims", "Before survivors", "After survivors"))) |> # reorder the intervals
   ggplot(aes(x = cutdist, y = avg, group = label, colour = label, fill = label)) +
  geom_errorbar(aes(ymin = second, ymax = first), width = 0.05, linewidth = 1, position = position_dodge(width = 0.3)) + 
  geom_line(linewidth = 1.2, position = position_dodge(width = 0.3)) +
  ylim(0, 0.7) +
  scale_size_continuous(breaks = c(5, 10, 15, 20, 25, 30)) +
  geom_point(aes(size = n), shape = 21, fill = "white", stroke = 2, position = position_dodge(width = 0.3)) +
 scale_fill_manual(values =  c("gray60","black","goldenrod")) + 
  scale_color_manual(values = c("gray60","black","goldenrod")) + 
  labs(x = "Great Circle Distance (km)", 
       y = "Similarity Value",
       title = "Second Pulse ",
       subtitle = "Jaccard by survival status", colour = "Stages", size = "Cell-pair comparison") +
  theme_classic() +
  theme(plot.title = element_text(face = "bold"),
        axis.title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        aspect.ratio = 1)

Setup for the second pulse analysis:

Czekanowski index calculation

After survivors

Code
# Data table 
prep_Afts <- map(sAft_boot, ~ czekanowski_data_prep(.x, cell = "cell", accepted_name = "accepted_name"))
# Add minimum field to each data sub-list.
prep_Afts <- map(prep_Afts, ~ map(.x, ~ mutate(.x, minimum = pmin(count_cell_x, count_cell_y))))

# Compute czekanowski
cz_Afts <- vector(mode = "list")
for(i in seq_along(prep_Afts)) {
  cZ <- lapply(prep_Afts[[i]],czekanowski_similarity)
  cz_Afts[[i]] <- cZ
}

# Find all pairs.
sp_01 <- vector(mode = "list")
for(i in 1:99) {
  pA <- do.call("rbind",lapply(prep_Afts[[i]], function(x) x[1:2][1,]))
  sp_01[[i]] <- pA
}

# Convert to dataframe and unlist.
flat_Afts <- purrr::map(cz_Afts, ~as.data.frame(unlist(.x)) |> rename("cz" = 1))
# Append correctly.
append_flat_afts <- mapply(function(x, y) cbind(y, x), flat_Afts, sp_01, SIMPLIFY = FALSE)

# Compute the average czekanowski per cell-pair.
res_Afts <- bind_rows(append_flat_afts) |> rename("x.cell" = "cell_x", "y.cell" = "cell_y")

Before victims

Code
# Data table 
prep_v <- map(sV_boot, ~ czekanowski_data_prep(.x, cell = "cell", accepted_name = "accepted_name"))
# Add minimum field to each data sub-list.
prep_v <- map(prep_v, ~ map(.x, ~ mutate(.x, minimum = pmin(count_cell_x, count_cell_y))))

# Compute czekanowski
cz_v <- vector(mode = "list")
for(i in seq_along(prep_v)) {
  cZ <- lapply(prep_v[[i]],czekanowski_similarity)
  cz_v[[i]] <- cZ
}

# Find all pairs.
sp_02 <- vector(mode = "list")
for(i in 1:99) {
  pQ <- do.call("rbind",lapply(prep_v[[i]], function(x) x[1:2][1,]))
  sp_02[[i]] <- pQ
}

# Convert to dataframe and unlist.
flat_v <- purrr::map(cz_v, ~as.data.frame(unlist(.x)) |> rename("cz" = 1))
# Append correctly.
append_flat_v <- mapply(function(x, y) cbind(y, x), flat_v, sp_02, SIMPLIFY = FALSE)

# Compute the average czekanowski per cell-pair.
res_v <- bind_rows(append_flat_v) |> rename("x.cell" = "cell_x", "y.cell" = "cell_y")

Before survivors

Code
# Data table 
prep_Befs <- map(sBef_boot, ~ czekanowski_data_prep(.x, cell = "cell", accepted_name = "accepted_name"))
# Add minimum field to each data sub-list.
prep_Befs <- map(prep_Befs, ~ map(.x, ~ mutate(.x, minimum = pmin(count_cell_x, count_cell_y))))

# Compute czekanowski
cz_Befs <- vector(mode = "list")
for(i in seq_along(prep_Befs)) {
  cM <- lapply(prep_Befs[[i]],czekanowski_similarity)
  cz_Befs[[i]] <- cM
}

# Find all pairs.
sp_03 <- vector(mode = "list")
for(i in 1:99) {
  pO <- do.call("rbind",lapply(prep_Befs[[i]], function(x) x[1:2][1,]))
  sp_03[[i]] <- pO
}

# Convert to dataframe and unlist.
flat_Befs <- purrr::map(cz_Befs, ~as.data.frame(unlist(.x)) |> rename("cz" = 1))
# Append correctly.
append_flat_Befs <- mapply(function(x, y) cbind(y, x), flat_Befs, sp_03, SIMPLIFY = FALSE)

# Compute the average czekanowski per cell-pair.
res_Befs <- bind_rows(append_flat_Befs) |> rename("x.cell" = "cell_x", "y.cell" = "cell_y")
Code
# Averages.
ave_s1 <- res_Afts |> 
  group_by(x.cell,y.cell) |>
  summarise(avg_cz =  mean(cz, na.rm = TRUE))
`summarise()` has grouped output by 'x.cell'. You can override using the
`.groups` argument.
Code
ave_s2 <- res_v |> 
  group_by(x.cell,y.cell) |>
  summarise(avg_cz =  mean(cz, na.rm = TRUE))
`summarise()` has grouped output by 'x.cell'. You can override using the
`.groups` argument.
Code
ave_s3 <- res_Befs |> 
  group_by(x.cell,y.cell) |>
  summarise(avg_cz =  mean(cz, na.rm = TRUE))
`summarise()` has grouped output by 'x.cell'. You can override using the
`.groups` argument.

Visualize results

Code
# Reverse cell pairs first.
ave_s1_reversed <- ave_s1|> rename("x.cell" = "y.cell", "y.cell" = "x.cell")
ave_s2_reversed <- ave_s2|> rename("x.cell" = "y.cell", "y.cell" = "x.cell")
ave_s3_reversed <- ave_s3|> rename("x.cell" = "y.cell", "y.cell" = "x.cell")

# after results: survivors.
qRes_01 <- after_res_matrix[,c(1:7,10)] |>
  inner_join(ave_s1_reversed, by = c("x.cell","y.cell"))
# before results: survivors & victims.
qRes_02 <- ave_s2_reversed |> 
  left_join(before_res_matrix[,c(1:7,10)],by = c("x.cell","y.cell"))

qRes_03 <- ave_s3_reversed |> 
  left_join(before_res_matrix[,c(1:7,10)],by = c("x.cell","y.cell"))

# Summary statistics for each survival category.

# After survivors.
sumRes_15 <-
  qRes_01 |> 
  group_by(cutdist) |> 
  summarise(
    avg =  mean(avg_cz, na.rm = TRUE),
    sdev = sd(avg_cz, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
    # quantiles
    first = quantile(avg_cz,probs=0.25),
    second = quantile(avg_cz,probs=0.95)
  ) |> 
  mutate(label = 'after survivors',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |>
  mutate(ci = se * qt(.95, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame()
Code
# Before victims.
sumRes_16 <-
  qRes_02 |> 
  group_by(cutdist) |> 
  summarise(
    avg =  mean(avg_cz, na.rm = TRUE),
    sdev = sd(avg_cz, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
       # Quantiles
    first = quantile(avg_cz,probs=0.25),
    second = quantile(avg_cz,probs=0.95)
  ) |> 
  mutate(label = 'before victims',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.95, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame() |> suppressWarnings()

# before survivors.
sumRes_17 <-
  qRes_03 |> 
  group_by(cutdist) |> 
  summarise(
    avg =  mean(avg_cz, na.rm = TRUE),
    sdev = sd(avg_cz, na.rm = TRUE),
    n = n(),
    se = sdev/sqrt(n),
       # Quantiles
    first = quantile(avg_cz,probs=0.25),
    second = quantile(avg_cz,probs=0.95)
  ) |> 
  mutate(label = 'before survivors',label = as.factor(label)) |> 
  mutate(cutdist = cutdist,
         cutdist = factor(cutdist,levels = c("0","2000","4000","6000","8000","10000","12000","14000","16000","18000","20000"))) |> 
  mutate(ci = se * qt(.97, n - 1), ci = as.numeric(ci)) |> 
  as.data.frame() |> suppressWarnings()

# Combine all results.
sumRes_18 <- bind_rows(sumRes_15,sumRes_16,sumRes_17)

# Plot.
sumRes_18 |> 
 ggplot(aes(x = cutdist, y = avg, group = label, colour = label, fill = label)) +
  geom_errorbar(aes(ymin = second, ymax = first), width = 0.05, linewidth = 1, position = position_dodge(width = 0.3)) + 
  geom_line(linewidth = 1.2, position = position_dodge(width = 0.3)) +
  ylim(0, 0.7) +
  scale_size_continuous(breaks = c(5, 10, 15, 20, 25, 30)) +
  geom_point(aes(size = n), shape = 21, fill = "white", stroke = 2, position = position_dodge(width = 0.3)) +
  scale_fill_manual(values =  c("goldenrod","gray60", "black")) + 
  scale_color_manual(values = c("goldenrod","gray60", "black")) + 
  labs(x = "Great Circle Distance (km)", 
       y = "Similarity Value",
       title = "Czekanowski by survival status",
       subtitle = "Subsampling by cells and occurrences", colour = "Stages", size = "Cell-pair comparison") +
  theme_classic() +
  theme(plot.title = element_text(face = "bold"),
        axis.title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        aspect.ratio = 1)

Extra

Count occurrences per genus in the Hirnantian to determine whether they match the disaster taxa in literature, such as the genus Hirnantia.

Code
# Find the top 5 genera based on number of occurrences
pbdb.hir  <- pbdb |> filter(interval.ma==443.8)

genus_counts <- pbdb.hir |>
  group_by(accepted_name) |> 
             summarise(count = n(), .groups = 'drop') |> 
             top_n(5, wt = count) |>
             arrange(desc(count))

# Convert accepted_name to a factor with levels in descending order of count

genus_counts$accepted_name <- factor(genus_counts$accepted_name, levels = genus_counts$accepted_name[order(genus_counts$count, decreasing = TRUE)])

# Visualize it with switched axes

ggplot(genus_counts, aes(x = count, y = accepted_name)) + theme_classic() + geom_bar(stat = "identity", fill = "dodgerblue3") + labs(x = "Number of Occurrences", y = "Genus Name")+ coord_flip() # Optional: this can be omitted if you want the horizontal bars

Wastebin taxa removed

Locate and remove all wastebin taxa, following the methods of Plotnick and Wagner (2005). First, using the full database, locate all wastebin taxa throughout the “before” and “after” intervals by finding the 5 most frequent genus occurrences.

Code
options(download.file.method = "libcurl")
pbdb <- read.csv("https://raw.githubusercontent.com/Anonymous-paleobio/Taxonomic-Homogenization-DeepTime/refs/heads/main/Pbdb-files/pbdb_lome.csv")

pbdb <- pbdb[, -c(1, 22, 21)]


 
# Get the frequency of each genus, then keep top 20 of them. these are our wastebasket genera
 
pbdb_occs <- pbdb |> 
  group_by(accepted_name) |> 
  summarise(frequency = n()) |> 
  arrange(desc(frequency)) |> 
   slice_head(n = 20) |>  # keep only the top 20 rows
  arrange(accepted_name) # order alphabetically
        

# Next, continue with the normal process of getting Jaccard similarity.

# Here is where we remove the wastebin taxa:            
pbdb <- pbdb |>    
  filter(interval.ma %in% c("445.2", "443.8", "440.8") &
       (!accepted_name %in% pbdb_occs$accepted_name)) # particularly, this line
 
# Identify Invalid Coordinates.
  cl <- cc_val(pbdb, value = "flagged", lat="paleolat", lon  ="paleolng") #flags incorrect coordinates
Testing coordinate validity
Flagged 0 records.
Code
  cl_rec <- pbdb[!cl,] #extract and check them
  
 pbdb <- pbdb |> 
   cc_val(lat = "paleolat", lon="paleolng") #remove them
Testing coordinate validity
Removed 0 records.

Next, follow all the steps from Visualization of Cells and Occurrences down to Czekanowski indices again. You will notice slight changes to the number of cells since filtering out the wastebin taxa reduces the number of occurrences we have.

##Spatial standardization

Here, we will apply the Divvy package to spatially standardize our data using the circular “cookie” method and re-run our Jaccard calculations with the standarized data.

Code
 library(raster)
pbdb <- read.csv("https://raw.githubusercontent.com/Anonymous-paleobio/Taxonomic-Homogenization-DeepTime/refs/heads/main/Pbdb-files/pbdb_lome.csv")

pbdb <- pbdb[, -c(1, 22, 21)]

# Make sure coordinates are numerical
pbdb$paleolat <- as.numeric(pbdb$paleolat)
pbdb$paleolng <- as.numeric(pbdb$paleolng)

# Subset by age
pbdb.before <- pbdb |> filter(early_interval == "Katian")
pbdb.after <- pbdb |> filter(early_interval == "Hirnantian")
pbdb.after2 <- pbdb |> filter(early_interval == "Rhuddanian")

# Get the frequency of each genus, then keep top 20 of them. these are our wastebasket genera
 
# Initialize equal earth projections and coordinate:
rWorld <-rast()
prj <- 'EPSG:8857'

# Match the divvy resolution with our hexa resution from the icosa package.
deg <- 4.5 # hexa's resolution

# Approximate meters per degree at the equator
meters_per_deg <- 111320  

# Adjust for the latitude of area
lat_center <- mean(raster::yFromCell(rWorld, 1:ncell(rWorld)))  # rough center latitude
meters_res <- deg * meters_per_deg * cos(lat_center * pi/180)
meters_res
[1] 500940
Code
# Get the new resolution
new_res <- meters_res


# New_res is the target resolution in meters
rPrj <- project(rWorld, prj, method = "bilinear", res = new_res)


terra::values(rPrj) <- 1:ncell(rPrj)

# Coordinate column names for the current and target coordinate reference system
xyCartes <- c('paleolng','paleolat')
xyCell   <- c('cellX','cellY')
Code
llOccs <- vect(pbdb.before, geom = xyCartes, crs = 'epsg:4326')
prjOccs <- terra::project(llOccs, prj)
pbdb.before$cell <- cells(rPrj, prjOccs)[,'cell']
pbdb.before[, xyCell] <- xyFromCell(rPrj, 
                                pbdb.before$cell)

sdSumry(pbdb.before, taxVar = 'accepted_name', xy = xyCell, crs = prj) |> 
  print()
     nOcc nLoc centroidX centroidY latRange greatCircDist meanPairDist
[1,] 3892  117  -4799668  -1670656  113.169      24627.71     8569.577
     minSpanTree     SCOR nTax
[1,]    90332.97 35.32539  813
Code
# Disregard SCOR 

occUniq <- uniqify(pbdb.before, xyCell)
ptsUniq <- st_as_sf(occUniq, coords = xyCell, crs = prj)


worldP <- ggplot(data = plates.lome$geometry) +
  theme_bw() +
  geom_sf() +
  geom_sf(data = ptsUniq, shape = 16, color = 'red3')

circLocs <- cookies(dat = pbdb.before,
                    xy = xyCell,
                    iter = 100,
                    nSite = 5,
                    r = 2000,
                    weight = TRUE,
                    crs = prj,
                    output = "full")


# Convert one circular sample to sf
smplPts <- st_as_sf(circLocs[[1]], coords = xyCell, crs = prj)

# Convert 2000 km to m
cntr <- smplPts[1, ]
r_km <- 2000
buf <- st_buffer(cntr, dist = r_km * 1000)

# Match world map to projection
world_sf <- st_as_sf(plates.lome) |> st_transform(crs = prj)


# Assume circLocs is your list of circular subsamples (1000 iterations)
# We'll just plot the first 5 for clarity
n_plot <- 3
set.seed(567)
subsamples <- circLocs[1:n_plot]

# Convert world map to sf and match projection
world_sf <- st_as_sf(plates.lome) |> st_transform(crs = prj)

# Create a data frame of points and buffers for all subsamples
all_points <- list()
all_buffers <- list()

for (i in seq_along(subsamples)) {
  smplPts <- st_as_sf(subsamples[[i]], coords = xyCell, crs = prj)
  
  # Pick the first point as center
  cntr <- smplPts[1, ]
  buf <- st_buffer(cntr, dist = 2000 * 1000)  # 2000 km in meters
  
  smplPts$iteration <- paste0("#", i)
  buf$iteration <- paste0("#", i)
  
  all_points[[i]] <- smplPts
  all_buffers[[i]] <- buf
}

# Combine all iterations
all_points_sf <- do.call(rbind, all_points)
all_buffers_sf <- do.call(rbind, all_buffers)



# Plot
ggplot() +
  geom_sf(data = world_sf, fill = "gray80", color = "white") +
    geom_sf(data = ptsUniq, shape = 16, color = 'black')+
  geom_sf(data = all_buffers_sf, fill = NA, linewidth = 1, color = "navyblue") +
  geom_sf(
    data = all_points_sf,
    aes(color = iteration),
    shape = 16,
    size = 3,
  ) +
  theme_bw() +
  labs(
    title = "Circular subsamples (first 3 iterations)",
    subtitle="Katian",
    x = NULL, y = NULL,
    color = "Cells in Iteration"
  )
Warning in rep(pch, length.out = length(x)): 'x' is NULL so the result will be
NULL

CircLocs now has 100 lists of subsampled occurrences which can be viewed with str(circLocs[[1]]):

Code
str(circLocs[[1]])
'data.frame':   1891 obs. of  23 variables:
 $ cell          : num  1186 1186 1186 1186 1186 ...
 $ cell_id       : int  26 26 26 26 26 26 26 26 26 26 ...
 $ occs          : int  487 487 487 487 487 487 487 487 487 487 ...
 $ interval.ma   : num  445 445 445 445 445 ...
 $ early_interval: chr  "Katian" "Katian" "Katian" "Katian" ...
 $ fad           : num  453 458 445 453 453 ...
 $ lad           : num  445 445 267 383 252 ...
 $ accepted_name : chr  "Anazyga" "Asaphus" "Clinopistha" "Cyclonema" ...
 $ genus         : chr  "Anazyga" "Asaphus" "Clinopistha" "Cyclonema" ...
 $ ex            : int  1 1 0 0 0 0 0 0 0 0 ...
 $ phylum        : chr  "Brachiopoda" "Arthropoda" "Mollusca" "Mollusca" ...
 $ class         : chr  "Rhynchonellata" "Trilobita" "Bivalvia" "Gastropoda" ...
 $ order         : chr  "Atrypida" "Asaphida" "Solemyida" "Euomphalina" ...
 $ family        : chr  "Anazygidae" "Asaphidae" "Solemyidae" "Platyceratidae" ...
 $ paleolat      : num  -5.95 -5.95 -5.95 -5.95 -5.95 -5.95 -6.46 -5.95 -5.95 -5.95 ...
 $ paleolng      : num  -114 -114 -114 -114 -114 ...
 $ occurrence_no : int  1302195 1302206 1302197 1302200 1302198 1302194 1255313 1302205 1302201 1302192 ...
 $ collection_no : int  174050 174050 174050 174050 174050 174050 166420 174050 174050 174050 ...
 $ reference_no  : int  16444 16444 16444 16444 16444 16444 53404 16444 16444 16444 ...
 $ long          : num  -114 -114 -114 -114 -114 ...
 $ lat           : num  -8.98 -8.98 -8.98 -8.98 -8.98 ...
 $ cellX         : num  -1.1e+07 -1.1e+07 -1.1e+07 -1.1e+07 -1.1e+07 ...
 $ cellY         : num  -595990 -595990 -595990 -595990 -595990 ...
Code
circLocs_before <- circLocs 

After the first pulse, spatially standardized:

Code
# Extract cell number and centroid coordinates associated with each occurrence

llOccs <- vect(pbdb.after, geom = xyCartes, crs = 'epsg:4326')
prjOccs <- terra::project(llOccs, prj)
pbdb.after$cell <- cells(rPrj, prjOccs)[,'cell']
pbdb.after[, xyCell] <- xyFromCell(rPrj, 
                                pbdb.after$cell)

sdSumry(pbdb.after, taxVar = 'accepted_name', xy = xyCell, crs = prj) |> 
  print()# Extract cell number and centroid coordinates associated with each occurrence
     nOcc nLoc centroidX centroidY latRange greatCircDist meanPairDist
[1,]  818   35 -76029.63  -2256249 96.56812      25271.42     8979.155
     minSpanTree     SCOR nTax
[1,]    52498.61 27.39688  339
Code
llOccs <- vect(pbdb.after, geom = xyCartes, crs = 'epsg:4326')
prjOccs <- terra::project(llOccs, prj)
pbdb.after$cell <- cells(rPrj, prjOccs)[,'cell']
pbdb.after[, xyCell] <- xyFromCell(rPrj, 
                                pbdb.after$cell)

sdSumry(pbdb.after, taxVar = 'accepted_name', xy = xyCell, crs = prj) |> 
  print()
     nOcc nLoc centroidX centroidY latRange greatCircDist meanPairDist
[1,]  818   35 -76029.63  -2256249 96.56812      25271.42     8979.155
     minSpanTree     SCOR nTax
[1,]    52498.61 27.39688  339
Code
# Disregard SCOR 

occUniq <- uniqify(pbdb.after, xyCell)
ptsUniq <- st_as_sf(occUniq, coords = xyCell, crs = prj)


# Circular subsampling technique

circLocsAft <- cookies(dat = pbdb.after,
                    xy = xyCell,
                    iter = 100,
                    nSite = 5,
                    r = 2000,
                    weight = TRUE,
                    crs = prj,
                    output = "full")


# Convert one circular sample to sf
smplPts <- st_as_sf(circLocs[[1]], coords = xyCell, crs = prj)

# Convert 2000 km to m
cntr <- smplPts[1, ]
r_km <- 2000
buf <- st_buffer(cntr, dist = r_km * 1000)

# Match world map to projection
world_sf <- st_as_sf(plates.lome) |> st_transform(crs = prj)

# worldP +
#  geom_sf(data = smplPts, shape = 17, color = 'navyblue') +
#  geom_sf(data = buf, fill = NA, linewidth = 1, color = 'navyblue')

Now, for the same thing with the “after-2” age for the second pulse.

After the second pulse, spatially standardized:

Code
llOccs <- vect(pbdb.after2, geom = xyCartes, crs = 'epsg:4326')
prjOccs <- terra::project(llOccs, prj)
pbdb.after2$cell <- cells(rPrj, prjOccs)[,'cell']
pbdb.after2[, xyCell] <- xyFromCell(rPrj, 
                                pbdb.after2$cell)

sdSumry(pbdb.after2, taxVar = 'accepted_name', xy = xyCell, crs = prj) |> 
  print()
     nOcc nLoc centroidX centroidY latRange greatCircDist meanPairDist
[1,]  538   21  -1702892  -2170373 90.82533      23677.03     9144.323
     minSpanTree     SCOR nTax
[1,]    37007.92 28.70596  244
Code
# Disregard SCOR 

occUniq <- uniqify(pbdb.after2, xyCell)
ptsUniq <- st_as_sf(occUniq, coords = xyCell, crs = prj)


circLocsAft2 <- cookies(dat = pbdb.after2,
                    xy = xyCell,
                    iter = 1000, # number of iterations
                    nSite = 5, # number of cells
                    r = 2000, # radial distance in km
                    weight = TRUE, 
                    crs = prj, # Equal Earth projection
                    output = 'full')


# Convert one circular sample to sf
smplPts <- st_as_sf(circLocs[[1]], coords = xyCell, crs = prj)

# Convert 2000 km to m
cntr <- smplPts[1, ]
r_km <- 2000
buf <- st_buffer(cntr, dist = r_km * 1000)

# Match world map to projection
world_sf <- st_as_sf(plates.lome) |> st_transform(crs = prj)

# worldP +
#  geom_sf(data = smplPts, shape = 17, color = 'navyblue') +
#  geom_sf(data = buf, fill = NA, linewidth = 1, color = 'navyblue')

Next, we will check to see how many InF values we have in sdSumry$SCOR. This won’t matter for most interval-pairs, but for the P-T and LOME intervals in which homogenization does occur, it’s a sanity check to understand why the SCOR does not reflect homogenization – taxon that are present in all cells analyzed here are given an InF score instead of a numerical value and are not included in the SCOR calculation. As per the Divvy vignette walkthrough: “When a taxon is present in all sampled locations, its log probability of incidence is infinite. Infinity is nonsensical in an empirical comparison (…).”

Since we won’t use SCOR beyond this next chunk of code, the SCOR value will not affect our Jaccard calculations even for homogenized intervals, so we can still spatially standardize the data and see how that effects the Jaccard values.

Code
sdbef <- sdSumry(circLocs, taxVar = 'genus', xy = xyCell, crs = prj)
sdaft <- sdSumry(circLocsAft, taxVar = 'genus', xy = xyCell, crs = prj) #tons of InF scores here in PT and LOME indicating widespread occurrences and homogenization has occurred.
sdaft2 <- sdSumry(circLocsAft2, taxVar = 'genus', xy = xyCell, crs = prj) 

is.infinite(sdaft$SCOR) # occurrences that are everywhere are labeled InF. We need to change this to 100 to reflect that they are everywhere, but I don't know how to do that. So, ignore SCOR for now.
  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE
 [13] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE
 [25]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
 [37]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
 [49] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
 [61] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
 [73] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE
 [85]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
 [97]  TRUE FALSE FALSE FALSE
Code
# before
sdbef_inf <- lapply(sdbef, function(x) {
  replace(x, is.infinite(x), 100)
})
sdbef_mean <- mean(sdbef_inf$SCOR)


# after:
sdaft_inf <- lapply(sdaft, function(x) {
  replace(x, is.infinite(x), 100)
})
sdaft_mean <- mean(sdaft_inf$SCOR)


# after-2:
sdaft2_inf <- lapply(sdaft2, function(x) {
  replace(x, is.infinite(x), 100)
})
sdaf2t_mean <- mean(sdaft2_inf$SCOR)

Now, we will calculate the spatially standardized Jaccard similarity for each geologic age:

Code
# before.
before_jaccard <- 
  jaccard_similarity(circLocs)

# after.
after_jaccard <- 
  jaccard_similarity(circLocsAft)

# after.
after2_jaccard <- 
  jaccard_similarity(circLocsAft2)


#before
ave_before_jaccard <- 
  bind_rows(before_jaccard) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")
`summarise()` has grouped output by 'cell_x'. You can override using the
`.groups` argument.
Code
# after
ave_after_jaccard <- 
  bind_rows(after_jaccard) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")
`summarise()` has grouped output by 'cell_x'. You can override using the
`.groups` argument.
Code
# after
ave_after2_jaccard <- 
  bind_rows(after2_jaccard) |>
  group_by(cell_x, cell_y) |> 
  summarise(avg_jaccard = mean(jaccard_similarity)) |> 
  rename("x.cell" = "cell_x", "y.cell" = "cell_y")
`summarise()` has grouped output by 'cell_x'. You can override using the
`.groups` argument.

##Spatially Standardized Jaccard for the LOME

Code
# this function will get mean and 95% confidence intervals 
mean_ci <- function(x, conf = 0.95) {
  x <- as.numeric(x)
  x <- x[!is.na(x)]
  n <- length(x)
  if (n == 0) return(c(mean = NA, lower = NA, upper = NA))
  m <- mean(x)
  se <- sd(x) / sqrt(n)
  t_crit <- qt(1 - (1 - conf)/2, df = n - 1)
  ci <- t_crit * se
  c(mean = m, lower = m - ci, upper = m + ci)
}


divvy_before  <- ave_before_jaccard |>  mutate(age = "Before")
divvy_after   <- ave_after_jaccard   |> mutate(age = "Pulse-1")
divvy_after2  <- ave_after2_jaccard |>  mutate(age = "Pulse-2")


#  add age columns
divvy_before  <- divvy_before  |> mutate(age = "Before")
divvy_after   <- divvy_after  |>  mutate(age = "After")
divvy_after2  <- divvy_after2  |>  mutate(age = "After-2")

# combine all data
divvy_combined <- bind_rows(divvy_before, divvy_after, divvy_after2)

# create a unique pair identifier
divvy_combined <- divvy_combined |>  mutate(pair = paste(x.cell, y.cell, sep = "_"))


# get average and 95% ci
summary_divvy <- divvy_combined |> 
  group_by(age) |> 
  summarise(
    mean = mean(avg_jaccard, na.rm = TRUE),
    lower = mean_ci(avg_jaccard)["lower"],
    upper = mean_ci(avg_jaccard)["upper"],
    .groups = "drop"
  )

# rename so it's in order
summary_divvy$age <- c("Pulse-1 (Hirnantian)", "Pulse-2 (Rhuddanian)","Before (Katian)" )


# now plot!
ggplot(summary_divvy, aes(x = age, y = mean, color = age)) +
  scale_color_manual(values=c("black", "darkorange4","goldenrod3"))+
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) +
  theme_minimal(base_size = 14) +
  theme_classic()+
  labs(
    title = "Spatially Standardized Global Jaccard (with 95th and 5th percentile)",
    x = "Age",
    y = "Mean Jaccard Similarity"
  ) +
  theme(legend.position = "none")

As a final note, what we just analyzed is the average global Jaccard value of the spatially standardized data. This plot does not separate similarity as a function of distance the way the main calculations do. So, that one number is a global value after the late Ordovician mass extinction after spatially subsampling 100 times using the cookie method, which agrees with our Jaccard calculations beforehand. What we calculated in Results beforehand also shows that most of the lowered similarity is from cells that are farther apart from one another and that, for this interval, similarity can vary regionally. By using both methods, we can account for regional differences and for differences in the dispersion of cells.